1 Morning

1.1 Basic usage

TIPS

  • Use Tab to auto complete
  • Use up arrow to get previous command

1.1.1 Create a new project in R Studio

  1. From File menu select New Project…
  2. From New Project Dialog select New Directory
  3. Select New Project as project type.
  4. Give a project name, i.e. “Workshop” and click Create Project button.

1.1.2 Create a new script.

  1. From the File menu select New File > R Script
  2. Save file as “my_Rscript.R”

Write the following commands in code editor of R Studio and run them using icon Run in R Studio

1.1.3 Creating variables

Important notes:

  • Assign a value using an “arrow” (<-) or equals (=). These mean the same thing.
    • The <- arrow is the “standard” in R, so we will use that.
  • Basic variable types:
    • Numeric: 1, or 2.1, or 1.63e-5
    • Character: "cat" or "dog"
    • Boolean: TRUE or FALSE

Look at the Environment panel (top right) as you run these commands: you will see the variables appear as you define them.

  1. Create different type of variables, note that the name must not start with a number.
    • You can use <- or = for assignment of values.
    • For boolean values, you can use T or F as shortcuts for TRUE or FALSE
a <- 100
b <- "hello"
c <- TRUE
d <- FALSE
  1. Just type the variable name to see its value
a
## [1] 100
  1. You can see the type of variable by using class()
class(b)
## [1] "character"

1.1.4 Multi-valued (collection) variables

We have several ways of creating collections of multiple values in a single variable.

1-dimensional:

  • Vector: multiple values of the same type, e.g. (1, 2, 3) or ("x","y","z")
  • List: multiple values of different types, e.g. (1, "cow", TRUE)

2-dimensional:

  • Matrix: table (rows by columns), all of the same type
  • Data frame: table (rows by columns), different columns can have different types
  1. Create a vector of numbers
x <- c(6,5,4)
x
## [1] 6 5 4
  1. Create a vector of character values.
y <- c("mouse", "cat", "dog")
y
## [1] "mouse" "cat"   "dog"
  1. Create a list of different things.
    • It can have ANY combination of objects you want.
    • In this example, the values are numeric, character, Boolean and the last element is a vector of numeric values.
z <- list(4, "cat", TRUE, c(1,2))
z
## [[1]]
## [1] 4
## 
## [[2]]
## [1] "cat"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1 2
  1. Create a matrix.
    • We use the nrow parameter to tell it to create a matrix with three rows.
    • The matrix() function will fill in the data column-wise.
    • 1:6 is a shortcut for a vector of all numbers from 1 to 6
A <- matrix(c(1:6), nrow=3)
A
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
  1. Create a data.frame.
    • A data.frame can combine different vectors of different data types but these MUST be the same length.
    • We give the data.frame() function vectors for each of the columns, in their column order.
df <- data.frame(x ,y)
df
  1. Check the size of variables:
    • length(): length of vector or list
    • dim(): number of rows and columns of a matrix or data frame
      • nrow() and ncol(): just the number of rows or columns
length(x)
## [1] 3
dim(A)
## [1] 3 2
nrow(A)
## [1] 3

1.1.5 Basic operations

Arithmetic:

  • Addition and subtraction: 1 + 2 ; 6 - 5
  • Multiplication and division: x * y ; 3 / 4

Boolean:

  • AND: a & b
  • OR: a | b

Set membership: %in%

Sorting objects: using order()

Transpose a matrix: swap the rows and columns, t() function

  1. Arithmetic on single values
1 + 1
## [1] 2
12/3
## [1] 4
  1. Arithmetic on a vector or matrix
    • First we look at the current values
    • Next is element-wise addition
    • Lastly is element-wise multiplication
A
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
A + A 
##      [,1] [,2]
## [1,]    2    8
## [2,]    4   10
## [3,]    6   12
A * A
##      [,1] [,2]
## [1,]    1   16
## [2,]    4   25
## [3,]    9   36
  1. Boolean operations
    • T and F are shortcuts for TRUE and FALSE
TRUE & FALSE
## [1] FALSE
T | F
## [1] TRUE
  1. Set membership: %in% operator
y
## [1] "mouse" "cat"   "dog"
"cat" %in% y
## [1] TRUE
"kitten" %in% y
## [1] FALSE
y %in% "cat"
## [1] FALSE  TRUE FALSE
  1. Sorting
    • order() gives the natural order of the INDICES
    • Use order() as the selection (in []) to obtain a sorted collection.
y
## [1] "mouse" "cat"   "dog"
order(y)
## [1] 2 3 1
y[order(y)] 
## [1] "cat"   "dog"   "mouse"
  • Sort a data frame alphabetically by the second column
df[ order(df[,2]), ]
  1. Transpose a matrix
A
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
t(A)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

1.1.6 Combining and extracting elements/items from objects

Combine objects:

  • Concatenate vectors or lists: c(vec1 ,vec2), c(list1, list2)
  • Combine matrices or data frames by column: cbind(A, B)
  • Combine matrices or data frames by row: rbind(A, B)

Extracting elements/items. Note that in R, the indexing starts at 1: vec[1] is the first item in a collection, vec[2] is the second, etc.

  • Get an element of a vector or list: vec[1]
    • NOTE: list[1] gives you a list with just the first element. list[[1]] gives you the value of the first item in the list.
  • Get an element of a matrix or data frame: matrix[1,2] (row, then column)
    • Get all items from first row: matrix[1, ] (leave blank after the comma)
    • Get all items from first column: matrix[, 1] (leave blank before the comma)
  • Specific to data frames:
    • df[1]: gives you a data frame with just the first column
    • df[[1]]: gives you a vector of the first column
  1. Combine vectors
c( c(1,2,3), 4:8 )
## [1] 1 2 3 4 5 6 7 8
  1. Combine lists
c( list("a","b"), list(1,c(2,3)) )
## [[1]]
## [1] "a"
## 
## [[2]]
## [1] "b"
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 2 3
  1. Combine matrices.
    • First by column
    • Second by row
cbind(A, A)
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    1    4
## [2,]    2    5    2    5
## [3,]    3    6    3    6
rbind(A, A)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## [4,]    1    4
## [5,]    2    5
## [6,]    3    6
  1. Extract items from a vector
    • First, just the first element
    • Then the first two elements. We provide a vector of indices to select.
x[1]
## [1] 6
x[c(1,2)]
## [1] 6 5
  1. Extract from a matrix
    • First, just the element in the first row and first column.
    • Then, first two rows for the first column. This will return a vector.
    • Same as previous but, return a matrix (do not simplify)
    • Lastly, first two rows with all columns
A[1,1]
## [1] 1
A[1:2, 1]
## [1] 1 2
A[1:2, 1, drop=F]
##      [,1]
## [1,]    1
## [2,]    2
A[1:2, ]
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
  1. Extract from a data frame.
    • Can use same operations as a matrix but, have a few special cases.
    • First will extract the first column and return a data.frame
    • The second will return the first column as a vector
df[1]
df[[1]]
## [1] 6 5 4

1.1.7 Naming

In a vector or list, we can name our elements. In a matrix or data frame, we can name the rows and columns. This is very useful for referencing values later.

Names can be used in brackets instead of numbers, e.g., vec["value1"]. For lists and data frames you can also use $ operator to refer to names (refers to column names for data frames).

  1. Naming elements in a vector.
    • Will add names to an existing vector (x) using the names() function.
    • Then will extract the element with the name “second”.
x
## [1] 6 5 4
names(x) <- c("first","second","third")
x
##  first second  third 
##      6      5      4
x["second"]
## second 
##      5
  1. Creating a named vector
newx <- c("cat"=1, "dog"=2)
newx
## cat dog 
##   1   2
  1. Assign row names and column names to a data.frame
    • This will add names to an existing data.frame
    • Can extract elements by row and/or column name.
    • For a data.frame we can use the $ operator to select a column (as a vector). NOTE: The $ operator only works for list or data.frame objects. This will not work for a matrix.
rownames(df) <- c("r1","r2","r3")
colnames(df) <- c("size","animal")
df
df["r1","animal"]
## [1] "mouse"
df[c("r1","r3"), ] 
df$animal 
## [1] "mouse" "cat"   "dog"

1.1.8 Factors

A factor is a special type of vector used to store categorical information. For example, storing group metadata. Example:

  • Experiment with wild-type (WT) and two different knock-outs (KO1, KO2).
  • There are two replicates for each genotype, so list each value twice.

Factors are very important when we set up statistical analyses and need to define groups.

f <- factor(c("WT","WT","KO1","KO1","KO2","KO2"))
f
## [1] WT  WT  KO1 KO1 KO2 KO2
## Levels: KO1 KO2 WT

A factor has:

  • A vector with discrete categorical values.
    • Could be numbers or characters, but NOT measurements (continuous values). Only a specific number of unique values.
  • Levels, which is a second vector of the possible unique values.
    • If you add to or modify the factor vector, you can only add values of the existing levels.
    • You can modify the displayed values of the vector by changing level names.
    • You can specify the order of the levels for convenience.
    • You can make it ordinal or ordered factor. This allows one to rank the levels.
  1. See the possible levels of a factor NOTE: This is a character vector with the unique values.
levels(f)
## [1] "KO1" "KO2" "WT"
  1. Change the name of the “WT” level
levels(f)[3] <- "Wild-Type"
f
## [1] Wild-Type Wild-Type KO1       KO1       KO2       KO2      
## Levels: KO1 KO2 Wild-Type
  1. Redefine the levels of f so “WT” is the first level.
f <- factor(c("WT","WT","KO1","KO1","KO2","KO2"), levels=c("WT","KO1","KO2"))
f
## [1] WT  WT  KO1 KO1 KO2 KO2
## Levels: WT KO1 KO2
  1. Create an ordinal or ordered factor.
f2 <- factor(c("time1","time2","time3"), ordered=T)
f2
## [1] time1 time2 time3
## Levels: time1 < time2 < time3
  1. Try to add a new value to the factor.
    • The value “KO3” is not a defined level. So, R will throw an error.
    • The value stored will be NA (missing).
f[6] <- "KO3" 
f
## [1] WT   WT   KO1  KO1  KO2  <NA>
## Levels: WT KO1 KO2

A very common use case is to make certain columns of a data frame factors. The data.frame will look the same. However, one will be able to see the possible levels for the column

df
df[,2] <- factor(df[,2])
df
df[,2]
## [1] mouse cat   dog  
## Levels: cat dog mouse

1.2 Writing functions and using apply

1.2.1 Write a function to find the max value of a vector.

This is just for our practice, as the max() function already exists in R.

Note: Use indentation (tab) to make codes more human readable.

get.max <- function(x){
    max <- x[1]
    for (i in x){
        if (i > max){
            max <- i
        }
    }
    return(max)
}
a <- c(23.3, 1, 3, 55, 6)
get.max(a)
## [1] 55

Compare with the build-in max function

max(a)
## [1] 55

1.2.2 apply statements

For this we will use a built-in data frame in R, mtcars.

  • mtcars is data extracted from the 1974 Motor Trend magazine. It has various aspects of car design and performance for 32 cars (1973-1974 models).
  1. Get information on mtcars data set.
?mtcars
  1. Get dimensions of the mtcars data.frame
dim(mtcars)
## [1] 32 11
  1. See the first six rows
head(mtcars)
  1. Use an apply function to find the max value for each feature (column)
apply(mtcars, 2, max) 
##     mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear 
##  33.900   8.000 472.000 335.000   4.930   5.424  22.900   1.000   1.000   5.000 
##    carb 
##   8.000
  1. Use the apply function to find which car has the maximum value of each feature.
    • We will define a function WITHIN the apply statement
    • What type of object does this return?
apply(mtcars, 2, function(x) rownames(mtcars)[x==max(x)])
## $mpg
## [1] "Toyota Corolla"
## 
## $cyl
##  [1] "Hornet Sportabout"   "Duster 360"          "Merc 450SE"         
##  [4] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
##  [7] "Lincoln Continental" "Chrysler Imperial"   "Dodge Challenger"   
## [10] "AMC Javelin"         "Camaro Z28"          "Pontiac Firebird"   
## [13] "Ford Pantera L"      "Maserati Bora"      
## 
## $disp
## [1] "Cadillac Fleetwood"
## 
## $hp
## [1] "Maserati Bora"
## 
## $drat
## [1] "Honda Civic"
## 
## $wt
## [1] "Lincoln Continental"
## 
## $qsec
## [1] "Merc 230"
## 
## $vs
##  [1] "Datsun 710"     "Hornet 4 Drive" "Valiant"        "Merc 240D"     
##  [5] "Merc 230"       "Merc 280"       "Merc 280C"      "Fiat 128"      
##  [9] "Honda Civic"    "Toyota Corolla" "Toyota Corona"  "Fiat X1-9"     
## [13] "Lotus Europa"   "Volvo 142E"    
## 
## $am
##  [1] "Mazda RX4"      "Mazda RX4 Wag"  "Datsun 710"     "Fiat 128"      
##  [5] "Honda Civic"    "Toyota Corolla" "Fiat X1-9"      "Porsche 914-2" 
##  [9] "Lotus Europa"   "Ford Pantera L" "Ferrari Dino"   "Maserati Bora" 
## [13] "Volvo 142E"    
## 
## $gear
## [1] "Porsche 914-2"  "Lotus Europa"   "Ford Pantera L" "Ferrari Dino"  
## [5] "Maserati Bora" 
## 
## $carb
## [1] "Maserati Bora"
  1. The tapply function will apply the function across the values (first argument) based on the INDEX groups (second argument). In this example, we will compute the mean mpg separately for each cylinder (cyl) count of the cars.
tapply(mtcars$mpg, mtcars$cyl, mean)
##        4        6        8 
## 26.66364 19.74286 15.10000
  1. The lapply function will run a function across elements in a collection and will return a list with the results.
    • For this example, we will get a list of which cars have different numbers of cylinders
    • First we will get the cylinder counts and sort them
    • Then we will use the cylinder counts to get the names
cyls <- unique(mtcars$cyl)
cyls
## [1] 6 4 8
cyls <- cyls[order(cyls)]
cyls
## [1] 4 6 8
lapply(cyls, function(x) rownames(mtcars)[mtcars$cyl==x])
## [[1]]
##  [1] "Datsun 710"     "Merc 240D"      "Merc 230"       "Fiat 128"      
##  [5] "Honda Civic"    "Toyota Corolla" "Toyota Corona"  "Fiat X1-9"     
##  [9] "Porsche 914-2"  "Lotus Europa"   "Volvo 142E"    
## 
## [[2]]
## [1] "Mazda RX4"      "Mazda RX4 Wag"  "Hornet 4 Drive" "Valiant"       
## [5] "Merc 280"       "Merc 280C"      "Ferrari Dino"  
## 
## [[3]]
##  [1] "Hornet Sportabout"   "Duster 360"          "Merc 450SE"         
##  [4] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
##  [7] "Lincoln Continental" "Chrysler Imperial"   "Dodge Challenger"   
## [10] "AMC Javelin"         "Camaro Z28"          "Pontiac Firebird"   
## [13] "Ford Pantera L"      "Maserati Bora"
  1. The sapply function is similar to the lapply in that it tries to simplify the results into a vector.
    • For example, we could modify the previous lapply statement to get a count of the cars per cylinder and return as a list
lapply(cyls, function (x) length(which(mtcars$cyl==x)))
## [[1]]
## [1] 11
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] 14
  • Using sapply will return as a vector
sapply(cyls, function (x) length(which(mtcars$cyl==x)))
## [1] 11  7 14

NOTE: For item 8, there are easier ways to compute these counts values. However, we are using these examples to show how the different apply statements work in R.

1.3 Read and write a file

  1. Read the file https://wd.cri.uic.edu/R/birth_weight.txt into a data frame
  2. Sort the data frame by mother’s age, and write to a new file
  • Data frame columns
    • bwt: baby’s birth weight in ounce (low bwt: < 88 ounces)
    • smoke: 0 - mother is not a smoker, 1 - smoker
    • parity: 0 - child first born, 1 - otherwise
    • gestation: length of pregnancy in days
    • age: mother’s age in years
    • height: mother’s height in inches
    • weight: mother’s pregnancy weight in pounds
bw_data  <- read.delim("https://wd.cri.uic.edu/R/birth_weight.txt")
head(bw_data)
data_ordered_by_age <- bw_data[order(bw_data$age), ]
head(data_ordered_by_age)
write.table(data_ordered_by_age,"birth_weight_ordered_by_age.txt",sep="\t",
   quote=F,row.names=F)

1.4 Install CRAN and Bioconductor packages

1.4.1 Check if selected packages are installed?

In R Studio you can select Packages tab in lower right pane to view all installed packages. Also possible to list via R command.

  1. List all install packages
installed.packages()
  1. Check if select packages are installed.
c("ggplot2", "ComplexHeatmap") %in% rownames(installed.packages())
## [1] FALSE FALSE

1.4.2 Install Bioconductor package: “ComplexHeatmap”

install.packages("BiocManager”) 
BiocManager::install("ComplexHeatmap", update=F)
library(ComplexHeatmap)

1.4.3 Install CRAN package: “tidyverse”

If you are asked for a CRAN mirror site, please select one which is close to your location: e.g. “USA (IN)” Tidyverse is a collection of packages, including tidyr, dplyr, and ggplot2. Technically each of these are dependencies for the tidyverse package, and so will get installed also.

  1. Install the package. You only need to do this ONCE!
install.packages("tidyverse")
  1. Load the libraries. You need to do this when you start R and are going to use the libraries.
library(tidyr)
library(dplyr)
library(ggplot2)

1.5 Cleaning and transforming a sample dataset

For this exercise, we will perform some clean-up steps on the mtcars data.

NOTE: Pipes (%>%) to head() are just so that we see the first 6 lines of each output.

1.5.1 Initial setup

  • Load “tidyverse”
library(tidyverse)
  • Take a quick peek at the data.
head(mtcars)
  • Before any of the examples, we will move car names, i.e. the row.names, to a column named “car”
data <- mtcars %>%
  rownames_to_column(var="car")
head(data)

1.5.2 Examples

  1. Example 1: Select only mpg, cyl, and hp and pipe to head() at the end to see the first few lines
data %>%
  select(car, mpg, cyl, hp) %>% head()
  1. Example 2: Rename columns for clarity
data %>%
  rename(miles_per_gallon=mpg,
         horsepower=hp) %>% head()
  1. Example 3: Filter cars with mpg greater than 25
data %>%
  filter(mpg > 25) %>% head()
  1. Example 4: Filter cars with 4 or 6 cylinders
data %>%
  filter(cyl %in% c(4, 6)) %>% head()
  1. Example 5: Create a new variable (column) called power_to_weight using mutate function.
data %>%
  mutate(power_to_weight=hp / wt) %>% head()
  1. Example 6: Use cut to categorize cars by engine size
  • The breaks are from 0 to 150, 150 to 300, and greater than or equal to 300
    • The labels set the labels for the factor
    • When right is set to FALSE, then the interval will be closed on the left (include the lower number) and open on the right (excludes the right number)
data %>%
  mutate(engine_size=cut(disp, breaks=c(0, 150, 300, Inf),
                           labels=c("small", "medium", "large"),
                           right=F)) %>% head()
  1. Example 7: Average mpg by number of cylinders.
  • The group_by function will group the data based on the column(s) given
  • The summarize function will then apply function(s) on the grouped data
data %>%
  group_by(cyl) %>%
  summarise(mean_mpg=mean(mpg))
  1. Example 8: Count cars per gear group
data %>%
  count(gear)
  1. Example 9: Pivot mpg, hp, and wt into long format.
  • We will just select the car, mpg, hp, and wt columns.
  • cols list the columns to pivot. All other columns will be copied on each row.
  • names_to will specify the name of the column in the long format that will have the original column names.
  • values_to will specify the name of the column in the long format that will have the values.
  • NOTE: The pivot_longer function will return a tibble object which is a more flexible data.frame. Showing a tibble will only show the first few rows and we don’t need to use head()
data_long <- select(data, c(car, mpg, hp, wt)) %>%
  pivot_longer(cols=c(mpg, hp, wt),
               names_to="variable",
               values_to="value")
data_long
  1. Example 10: Pivot back to wide format
    • names_from specifies the column to use as column names in the wide format
    • values_from specifies the column to use as the values to put in the pivoted columns.
    • All other columns will just appear as is in each row.
data_long %>%
  pivot_wider(names_from=variable, values_from=value)
  1. Example 11: Introduce some NA values
data_with_na <- data %>%
  mutate(mpg=replace(mpg, mpg < 20, NA))
head(data_with_na)
  1. Example 12: Drop rows with NA
data_with_na %>%
  drop_na() %>% head()
  1. Example 13: Replace NA with a fixed value
data_with_na %>%
  replace_na(list(mpg=0)) %>% head()

2 Afternoon

2.1 Data Visualization

2.1.1 Data set

Read in the data for this exercise. This is a table of cell metadata and selected gene expression levels from a single-cell RNA-seq project. The columns in this table are:

  • UMIs: total UMI (unique read) count per cell.
  • Genes: number of genes expressed.
  • Genotype: WT or KO, 2 genotypes profiled in this experiment.
  • Batch: A or B, 2 replicate captures collected in this experiment.
  • PercentMT: Fraction of reads mapping to mitochondrial genes. Higher levels can indicate issues with cell viability. For this data set, cells with % MT > 15% have already been filtered out.
  • HasTCR: Whether or not this cell expressed a TCR, based on a separate TCR library collected for these samples.
  • Cluster: cluster assignment from analysis.
  • UMAP_1 and UMAP_2: UMAP coordinates for each cell.
  • Log-scaled normalized gene expression levels for several genes of interest.
  1. Load the data from the URL
sc <- read.delim("http://wd.cri.uic.edu/R/scRNA_cells.txt", row.names=1)
  1. Convert the Cluster column to a factor
sc$Cluster <- factor(sc$Cluster)
head(sc)

2.1.2 Scatterplots

We will make various versions of a UMAP plot.

Using base R

  1. Basic scatter plot
plot(sc$UMAP_1, sc$UMAP_2)

  1. Colored by cluster - this is kind of clunky
    • We are using factors as a shortcut to make unique colors for the clusters:
      • Make a factor of the cluster assignments.
      • Rename the factor levels as color names (RGB codes). The rainbow function gives an even spacing of colors across the rainbow for a given number of colors.
cell_colors <- factor(sc$Cluster)
levels(cell_colors) <- rainbow(length(levels(cell_colors)))
plot(sc$UMAP_1, sc$UMAP_2, col=cell_colors)

It would be even better to add a legend. This is possible in base R plots, but much easier with ggplot2.

Using ggplot2

ggplot2 makes it much easier to add more features to our plot, like automatically coloring dots based on a categorical label and adding a legend.

  1. Load the ggplot2 library
library(ggplot2)
  1. Make the basic scatter plot (geom_point) using the the values from the columns UMAP_1 and UMAP_2 as the x and y coordinates and color by the values in the column Cluster
ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Cluster)) +
  geom_point()

  1. Save the plot to a PDF.
pdf("basic_UMAP_plot.pdf")
ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Cluster)) +
  geom_point()
dev.off()
  1. Facet by genotype (facet_wrap)

If you do not see a plot appear, make sure you have run dev.off() above.

ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Cluster)) +
  geom_point() +
  facet_wrap( ~ Genotype)

  1. Facet by genotype and batch (facet_grid)
ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Cluster)) +
  geom_point() +
  facet_grid( Batch ~ Genotype)

  1. Update the labels and the theme
ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Cluster)) +
  geom_point() +
  theme_classic() +
  labs(x="UMAP1", y="UMAP2", color="Cell type", title="UMAP by cluster")

  1. Color by expression of a gene of interest
ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Cxcr6)) +
  geom_point()

  • Nicer coloring with viridis palette
ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Cxcr6)) +
  geom_point() +
  scale_color_viridis_c()

  1. Color by genotype, using a custom color scheme
ggplot(sc, aes(x=UMAP_1, y=UMAP_2, color=Genotype)) +
  geom_point() +
  scale_color_manual(values=c("WT"="red","KO"="blue"))

2.1.3 Box plots and violin plots

These are both ways to plot the distribution of values.

  1. Show the level of Cxcr6 by cluster and genotype
    • First as a boxplot
ggplot(sc, aes(x=Cluster, y=Cxcr6, fill=Genotype)) +
  geom_boxplot()

  • As a violin plot
ggplot(sc, aes(x=Cluster, y=Cxcr6, fill=Genotype)) +
  geom_violin()

  • Same but with using variables.
baseplot <- ggplot(sc, aes(x=Cluster, y=Cxcr6, fill=Genotype))
baseplot + geom_boxplot()

baseplot + geom_violin()

Violin plot for many genes

This is an example of showing expression for many marker genes over all clusters in a compact visualization.

  1. Make a new data frame with just the clusters, and the gene expression levels
gene_expr <- sc[,c(7,10:17)]
head(gene_expr)
  1. Convert to the “long” format. Load the tidyr library if you haven’t already.
library(tidyr)
gene_expr_long <- pivot_longer(gene_expr, !Cluster)
gene_expr_long
  1. Basic violin plot
    • We are using the faceting to show the different groups, rather than separating along the x-axis
basic_vioplot <- ggplot(gene_expr_long, aes(x=NA,y=value,fill=Cluster)) +
  geom_violin() + facet_grid(Cluster ~ name)
basic_vioplot

  1. Make it look nicer.
    • theme_void() removes the x and y axis tics, labels, and the background grid.
    • coord_flip() makes the violins go left-to-right instead of up and down.
    • guides(fill="none") turns off the legend for the fill color, as this is redundant with the cluster facet labels.
    • theme(strip.text.x=element_text(angle=45)) rotates the x-direction facet labels (gene names) by 45 degrees.
basic_vioplot + 
  theme_void() + 
  coord_flip() +
  guides(fill="none") +
  theme(strip.text.x=element_text(angle=45))

2.1.4 Bar plots

  1. Number of cells per cluster
    • First using geom_bar() to have ggplot2 count for you.
ggplot(sc, aes(x=Cluster)) + geom_bar()

  • Or do the count first and then use geom_col() to plot the values as is.
cluster_counts <- table(sc$Cluster)
cluster_counts
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13 
## 949 891 809 718 687 566 526 486 310 303 212 179 106  97
cluster_counts_df <- data.frame(cluster_counts)
cluster_counts_df
ggplot(cluster_counts_df, aes(x=Var1, y=Freq)) + geom_col()

  1. Cells per cluster, also with genotype and TCR
    • First as a stacked barplot
ggplot(sc, aes(x=Cluster, fill=Genotype)) +
  geom_bar()

  • Side-by-side barplot
ggplot(sc, aes(x=Cluster, fill=Genotype)) +
  geom_bar(position="dodge")

  • Facet by TCR status
ggplot(sc, aes(x=Cluster, fill=Genotype)) +
  geom_bar(position="dodge") +
  facet_wrap(~HasTCR)

2.2 Heatmaps

The test data set are the top 100 differentially expressed genes (DEGs) from an RNA-seq project with 3 groups (Control, disease model 1, disease model 2). The values in this table are log2 CPM (log-scaled normalized expression).

  1. Load the ComplexHeatmap library and the data.
library(ComplexHeatmap)
degs <- read.delim("http://wd.cri.uic.edu/R/degs.txt",row.names=1)
  1. Look at the top of the file
head(degs)
  1. Compute z-score expression.
  • The scale() function will z-score the data by columns. We want to z-score by row. So, we transpose twice.
degs.z <- t(scale(t(degs)))
  1. Plot a basic heatmap
Heatmap(degs.z, name="Z-score")

  1. Turn off display of row names
Heatmap(degs.z, name="Z-score", show_row_names=F)

  1. Add a color annotation for the groups
    • First will get the groups for each sample. In these data the group names are included in the sample names.
    • Then create a named vector with the specified colors for each group with the name as the group/level and the value as the color.
    • Create a HeatmapAnnotation with the group data and the colors.
    • Finally, generate the heatmap with the annotation on the top.
groups <- gsub("\\.[0-9]*$","",colnames(degs))
groups
##  [1] "Control" "Control" "Control" "Control" "Model1"  "Model1"  "Model1" 
##  [8] "Model1"  "Model2"  "Model2"  "Model2"  "Model2"
group_colors <- c("Control"="blue","Model1"="orange","Model2"="purple")
column_label <- HeatmapAnnotation(df=data.frame(Group=groups),
  col=list(Group=group_colors), which="column")
Heatmap(degs.z, name="Z-score", show_row_names=F, top_annotation=column_label)

2.3 PCA

We’ll the same RNA-seq data as above. NOTE: usually you would compute PCA over all genes, rather than just the top ones as in this example.

  1. Compute the PCA ordination data using prcomp() function.
    • Like scale, prcomp will use the columns as the data (features).
    • In these data, the features (genes) are on the rows. So, we need to transpose
pca <- prcomp(t(degs))
  1. We can use the summary() function to calculate summary statistics on the pca
summary(pca)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     20.0719 3.28916 1.09160 1.02606 0.90698 0.80002 0.71785
## Proportion of Variance  0.9614 0.02582 0.00284 0.00251 0.00196 0.00153 0.00123
## Cumulative Proportion   0.9614 0.98720 0.99004 0.99256 0.99452 0.99605 0.99728
##                            PC8     PC9    PC10    PC11      PC12
## Standard deviation     0.69225 0.53578 0.47392 0.38804 2.653e-15
## Proportion of Variance 0.00114 0.00069 0.00054 0.00036 0.000e+00
## Cumulative Proportion  0.99842 0.99910 0.99964 1.00000 1.000e+00
  1. We can save the summary as an R object (list)
    • We can view the available elements in the summary using names().
    • The importance element is a data.frame with the statistics for each PC axis
pca_stats <- summary(pca)
names(pca_stats) 
## [1] "sdev"       "rotation"   "center"     "scale"      "x"         
## [6] "importance"
importance <- pca_stats$importance
importance
##                             PC1      PC2      PC3      PC4       PC5       PC6
## Standard deviation     20.07188 3.289157 1.091597 1.026061 0.9069831 0.8000232
## Proportion of Variance  0.96138 0.025820 0.002840 0.002510 0.0019600 0.0015300
## Cumulative Proportion   0.96138 0.987200 0.990040 0.992560 0.9945200 0.9960500
##                              PC7       PC8       PC9      PC10     PC11
## Standard deviation     0.7178452 0.6922487 0.5357828 0.4739224 0.388043
## Proportion of Variance 0.0012300 0.0011400 0.0006900 0.0005400 0.000360
## Cumulative Proportion  0.9972800 0.9984200 0.9991000 0.9996400 1.000000
##                                PC12
## Standard deviation     2.653185e-15
## Proportion of Variance 0.000000e+00
## Cumulative Proportion  1.000000e+00
  1. Add the group metadata to the PC coordinates
pca_coords <- data.frame(pca$x)
head(pca_coords)
pca_coords$Group <- groups
  1. Create the PCA plot. We will add the percent explained variance for each PC in the axis labels. This comes from the importance data.frame created in step 3.
ggplot(data=pca_coords, aes(x=PC1, y=PC2, color=Group)) +
    geom_point() +
    theme_classic() +
    labs(x=paste0("PC1 (", round(100*importance[2,1], 1), "%)"),
         y=paste0("PC2 (", round(100*importance[2,2], 1), "%)"))

  1. Generate a scree plot
screeplot(pca)

  1. OPTIONAL Screeplot using ggplot2 We will need to get the importance into a format for ggplot2.
    • We will create a data.frame with each PC axis on a row.
    • The axis column will be a factor with the levels by PC order.
    • For simplicity, we are renaming “Standard deviation”, “Proportion of Variance” and “Cumulative Proportion” to “sdev”, “var” and “cum_var”, respectively
import_df <- as.data.frame(t(importance)) %>% 
  rownames_to_column("axis") %>%
  mutate(axis=factor(axis, levels=colnames(importance)))

colnames(import_df)[2:4] <- c("sdev", "var", "cum_var")

ggplot(import_df, aes(x=axis, y=var)) + 
  geom_col(fill="gray", color="black") + 
  labs(y="Proportion of Variance") +
  theme_classic()

NOTE: although all three groups look well-separated in the PC1 vs PC2 plot, note that the biggest differences are with respect to Model1: this is separated along PC1, which explains the vast majority of the variance. This is consistent with what we saw in the heatmap.

2.3.1 PCA plot with sample labels (TAKE-HOME)

For this we’ll use the ggrepel package, which does a good job of fitting text labels into a plot.

Install the package first:

install.packages("ggrepel")
  1. Load the ggrepel library
library(ggrepel)
  1. Add the sample names to our data frame
pca_coords$Sample <- rownames(pca_coords)
  1. Create PCA plot, with label aesthetics and geom_text_repel layer. set show legend=F to avoid the duplicate legend for text colors
ggplot(data=pca_coords, aes(x=PC1, y=PC2, color=Group, label=Sample)) +
    geom_point() +
    theme_classic() +
    geom_text_repel(show.legend=F) +
    labs(x=paste0("PC1 (", round(100*importance[2,1], 1), "%)"),
         y=paste0("PC2 (", round(100*importance[2,2], 1), "%)"))

2.4 Differences between groups (t-test, ANOVA)

2.4.1 Data set

We will use the birth weight data from earlier today. Read that back in if needed.

bw_data <- read.delim("https://wd.cri.uic.edu/R/birth_weight.txt")

2.4.2 Check normality: Shapiro test

Caution: a large sample size will usually result in a significant p-value from Shapiro test, because any tiny deviation from normality will be detected. We should also look at how far W is from 1 to see how big the deviation is. For reasonably small differences, the normality assumption is probably still OK.

  1. First we will get a sense of the number of replicates (n) and plot a histogram
length(bw_data$bwt)
## [1] 1174
hist(bw_data$bwt)

  1. Run the Shapiro test. Look at the value of W and p
shapiro.test(bw_data$bwt)
## 
##  Shapiro-Wilk normality test
## 
## data:  bw_data$bwt
## W = 0.99563, p-value = 0.001917
  1. Use QQ-normal plot (qqnorm()) to check the normality.
  • If the data is normally distributed, the points in the QQ-normal plot will lie on a straight diagonal line.
  • The qqline() function will show the line for a “theoretical” normal distribution
qqnorm(bw_data$bwt)
qqline(bw_data$bwt, col="red")

Alternative data - run test on raw counts from RNA-seq

  1. Get raw sequence data
raw.count <- read.delim("http://wd.cri.uic.edu/advanced_R/raw.count.txt",row.names=1)
  1. Convert to the first gene’s raw read count to a numeric vector
count <- as.numeric(raw.count[1, ])
  1. Check the number of samples and plot a histogram
length(count)
## [1] 46
hist(count)

  1. Run the Shaprio test
shapiro.test(count)
## 
##  Shapiro-Wilk normality test
## 
## data:  count
## W = 0.49027, p-value = 2.014e-11
  1. Create the Q-Q plot
qqnorm(count)
qqline(count, col="red")

Birth weight values are close to normal. The RNA-seq values are not close to normal.

2.4.3 t-test and Wilcox test

Test birth weight vs mother’s smoking status.

  1. Make the “smoke” column a factor for easier handling.
bw_data$smoke <- factor(bw_data$smoke)
  1. Run Student’s t-test
  • If one just runs the function, it will print a report
t.test(bwt ~ smoke, data=bw_data)
## 
##  Welch Two Sample t-test
## 
## data:  bwt by smoke
## t = 8.6265, df = 941.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   7.158132 11.374153
## sample estimates:
## mean in group 0 mean in group 1 
##        123.0853        113.8192
  • The output can be saved to a variable to extract particular attributes of the results.
ttest_result <- t.test(bwt ~ smoke, data=bw_data)
ttest_result
## 
##  Welch Two Sample t-test
## 
## data:  bwt by smoke
## t = 8.6265, df = 941.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   7.158132 11.374153
## sample estimates:
## mean in group 0 mean in group 1 
##        123.0853        113.8192
ttest_result$p.value
## [1] 2.656464e-17
ttest_result$conf.int
## [1]  7.158132 11.374153
## attr(,"conf.level")
## [1] 0.95
  1. Run Wilcox, a.k.a. Mann-Whitney, test

    Like the t.test() function, one can save the output to a variable to extract particular attributes.

wilcox.test(bwt ~ smoke, data=bw_data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  bwt by smoke
## W = 212970, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
wilcox_result <- wilcox.test(bwt ~ smoke, data=bw_data)
wilcox_result$p.value
## [1] 6.485236e-18
  1. Generate a box plot to visualize.
ggplot(bw_data, aes(x=smoke, y=bwt)) +
  geom_boxplot()

  1. OPTIONAL: Add the results from the tests to the plot.
    • We will use the signif() function to round to 3 significant figures.
ggplot(bw_data, aes(x=smoke, y=bwt)) +
  geom_boxplot() +
  labs(title="Baby's birth weight vs. Mother's smoking status",
       subtitle=paste0("t-test, p=", signif(ttest_result$p.value, 3), 
                       ", Wilcox p=", signif(wilcox_result$p.value, 3)))

2.4.4 One-way ANOVA and Kruskal-Wallis test

Test birth weight against categorized values of gestation.

For this example, we will categorize the data based on the following criteria

  • < 37 weeks: pre-term
  • 37-41 weeks: full-term
  • > 41 weeks: late term
  1. Categorize the data. We will use the cut() function to create a factor from the current gestational age (in days)
    • The breaks are from 0 to 37 * 7, 37 * 7 to 41 * 7, and older than 41 * 7
    • The labels set the labels for the factor
    • When right is set to FALSE, then the interval will be closed on the left (include the lower number) and open on the right (excludes the right number)
library(dplyr)
bw_data <- bw_data %>%
  mutate(gestation_cat=cut(gestation, 
                             breaks=c(0, 37*7, 41*7, Inf), 
                             labels=c("pre term", "full term", "late term"),
                             right=F))
  1. Run the ANOVA
anova <- aov(bwt ~ gestation_cat, data=bw_data)
summary(anova)
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## gestation_cat    2  61546   30773   108.4 <2e-16 ***
## Residuals     1171 332512     284                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Like the t-test and Wilcox test, one can save the results to a variable.
In the case of ANOVA, save the output from summary

  • The p-values for each term are saved in the column named “Pr(>F)”
  • NOTE: The last item in the p-values is NA as it corresponds to the residuals.
anova_result <- summary(anova)
anova_pvalues <- anova_result[[1]][["Pr(>F)"]]
anova_pvalues
## [1] 6.571396e-44           NA
  1. Generate diagnostic plots for ANOVA
plot(anova)

2.4.5 Kruskal-Wallis test

  1. Run the Kruskal-Wallis test
kruskal.test(bwt ~ gestation_cat, data=bw_data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  bwt by gestation_cat
## Kruskal-Wallis chi-squared = 146.76, df = 2, p-value < 2.2e-16
  1. Create a box plot to visualize
ggplot(bw_data, aes(x=gestation_cat, y=bwt)) +
  geom_boxplot()

2.4.6 Two-way ANOVA

Compare birth weight to categorized gestation and smoking

  1. Ensure that “smoking” is a factor (not character)
bw_data$smoke <- factor(bw_data$smoke)
  1. Run without interaction
summary(aov(bwt ~ gestation_cat + smoke, data=bw_data))
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## gestation_cat    2  61546   30773  114.97 <2e-16 ***
## smoke            1  19338   19338   72.25 <2e-16 ***
## Residuals     1170 313174     268                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Run with interaction note that gestation_cat:smoke is the interaction term
summary(aov(bwt ~ gestation_cat * smoke, data=bw_data))
##                       Df Sum Sq Mean Sq F value Pr(>F)    
## gestation_cat          2  61546   30773 115.609 <2e-16 ***
## smoke                  1  19338   19338  72.649 <2e-16 ***
## gestation_cat:smoke    2   2272    1136   4.268 0.0142 *  
## Residuals           1168 310902     266                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Generate visualization
ggplot(bw_data, aes(x=gestation_cat, fill=smoke, y=bwt)) +
  geom_boxplot()

It appears that the effect of smoking is strongest for earlier term pregnancies. The significance of the dependence of smoking effect on pregnancy term is captured in the interaction term of the model.

2.5 Correlation and linear regression

2.5.1 Correlation

Comparison of mother’s height and weight.

  1. Perform correlation test using Pearson’s test
cor.test(bw_data$height, bw_data$weight)
## 
##  Pearson's product-moment correlation
## 
## data:  bw_data$height and bw_data$weight
## t = 16.552, df = 1172, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3877305 0.4805332
## sample estimates:
##       cor 
## 0.4352874
  1. Using Spearman’s test
cor.test(bw_data$height, bw_data$weight, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  bw_data$height and bw_data$weight
## S = 134713533, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5004735
  1. Using Kendall’s Tau test
cor.test(bw_data$height, bw_data$weight, method="kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  bw_data$height and bw_data$weight
## z = 18.02, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.3743995
  1. Generate scatter plot
ggplot(bw_data, aes(x=height, y=weight)) +
  geom_point()

2.5.2 Linear regression

This time compare birth weight to gestation (comparison already done as categorized gestation) and other variables.

  1. Perform simple linear regression
summary(lm(bwt ~ gestation, data=bw_data))
## 
## Call:
## lm(formula = bwt ~ gestation, data = bw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.348 -11.065   0.218  10.101  57.704 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.75414    8.53693   -1.26    0.208    
## gestation     0.46656    0.03054   15.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.74 on 1172 degrees of freedom
## Multiple R-squared:  0.1661, Adjusted R-squared:  0.1654 
## F-statistic: 233.4 on 1 and 1172 DF,  p-value: < 2.2e-16
  1. Generate scatter plot
ggplot(bw_data, aes(x=gestation, y=bwt)) +
  geom_point()

  1. Add best-fit line to scatter plot.
    • se=T will add standard error bars
ggplot(bw_data, aes(x=gestation, y=bwt)) +
  geom_point() +
  geom_smooth(method="lm", se=T)
## `geom_smooth()` using formula = 'y ~ x'

  1. Perform multiple regression
  1. additive model
summary(lm(bwt ~ gestation + smoke + weight, data=bw_data))
## 
## Call:
## lm(formula = bwt ~ gestation + smoke + weight, data = bw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.920 -10.759  -0.279   9.743  51.354 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.62648    8.69128  -2.028   0.0428 *  
## gestation     0.44809    0.02936  15.261  < 2e-16 ***
## smoke1       -8.07789    0.96444  -8.376  < 2e-16 ***
## weight        0.11818    0.02267   5.213  2.2e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.07 on 1170 degrees of freedom
## Multiple R-squared:  0.2335, Adjusted R-squared:  0.2315 
## F-statistic: 118.8 on 3 and 1170 DF,  p-value: < 2.2e-16
  1. Using the full model with all interactions
summary(lm(bwt ~ gestation * smoke * weight, data=bw_data))
## 
## Call:
## lm(formula = bwt ~ gestation * smoke * weight, data = bw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.657 -10.597  -0.062   9.738  47.653 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             -1.118e+02  7.032e+01  -1.590  0.11208   
## gestation                7.857e-01  2.514e-01   3.125  0.00182 **
## smoke1                   2.711e+01  1.093e+02   0.248  0.80409   
## weight                   9.945e-01  5.297e-01   1.877  0.06070 . 
## gestation:smoke1        -1.197e-01  3.912e-01  -0.306  0.75965   
## gestation:weight        -3.140e-03  1.895e-03  -1.657  0.09777 . 
## smoke1:weight           -7.161e-01  8.342e-01  -0.858  0.39082   
## gestation:smoke1:weight  2.520e-03  2.984e-03   0.845  0.39848   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.99 on 1166 degrees of freedom
## Multiple R-squared:  0.2431, Adjusted R-squared:  0.2386 
## F-statistic: 53.51 on 7 and 1166 DF,  p-value: < 2.2e-16

The overall explained variance is about the same but with more terms in the model the explanatory value of each variable is lower. We will test the effect of each variable using ANOVA.

summary(aov(bwt ~ gestation * smoke * weight, data=bw_data))
##                          Df Sum Sq Mean Sq F value   Pr(>F)    
## gestation                 1  65450   65450 255.871  < 2e-16 ***
## smoke                     1  19533   19533  76.365  < 2e-16 ***
## weight                    1   7015    7015  27.425 1.94e-07 ***
## gestation:smoke           1   3069    3069  11.997 0.000552 ***
## gestation:weight          1    538     538   2.104 0.147137    
## smoke:weight              1     18      18   0.072 0.788345    
## gestation:smoke:weight    1    182     182   0.713 0.398481    
## Residuals              1166 298252     256                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Generate different plots for comparison of variables
ggplot(bw_data, aes(x=gestation, y=bwt, color=smoke)) +
  geom_point() +
  geom_smooth(method="lm", se=T)
## `geom_smooth()` using formula = 'y ~ x'

ggplot(bw_data, aes(x=weight, y=bwt, color=smoke)) +
  geom_point() +
  geom_smooth(method="lm", se=T)
## `geom_smooth()` using formula = 'y ~ x'

2.6 Fisher’s Exact and Chi-squared tests

2.6.1 Fisher’s Exact test

Test for an association of smoking and parity.

  1. Make sure both are factors. For this example, we will use the mutate() function from the dplyr package to do in one command.
bw_data <- bw_data %>% 
  mutate(smoke=factor(smoke), 
         parity=factor(parity))
  1. Will take a peek at the data and then use the table() function to generate the contingency table.
head(bw_data[, c("smoke","parity")])
smoke_vs_parity_table <- table(bw_data[, c("smoke", "parity")])
smoke_vs_parity_table
##      parity
## smoke   0   1
##     0 525 190
##     1 341 118
  1. Run Fisher’s Exact test. First we will run to see the report and then save to a variable
fisher.test(smoke_vs_parity_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  smoke_vs_parity_table
## p-value = 0.7857
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7245511 1.2590005
## sample estimates:
## odds ratio 
##  0.9561918
fet <- fisher.test(smoke_vs_parity_table)
  1. Compute the log2 odd’s ratio
log2(fet$estimate)
## odds ratio 
## -0.0646281
  1. Generate visualizations.
  • First plot the total counts.
  • Then plot the proportions. This is what we are testing.
ggplot(bw_data, aes(x=smoke, fill=parity)) +
  geom_bar()

ggplot(bw_data, aes(x=smoke, fill=parity)) +
  geom_bar(position="fill") +
  labs(y="Fraction")

No effect observed here.

2.6.2 Chi-square test

Test for association of smoking and gestation category

If you had not done so before, run the categorization

bw_data <- bw_data %>%
  mutate(gestation_cat=cut(gestation, 
                             breaks=c(0, 37*7, 41*7, Inf), 
                             labels=c("pre term", "full term", "late term"),
                             right=F))
  1. Run the Chi-square test
    • Like Fisher’s exact test, first we will create a contigency table.
    • Then run chisq.test() on the contigency table
smoke_vs_gest_table <- table(bw_data[,c("smoke","gestation_cat")])
chisq.test(smoke_vs_gest_table)
## 
##  Pearson's Chi-squared test
## 
## data:  smoke_vs_gest_table
## X-squared = 9.8147, df = 2, p-value = 0.007392
  1. Generate a visualization.
ggplot(bw_data, aes(x=smoke, fill=gestation_cat)) +
  geom_bar(position="fill") +
  labs(y="Fraction")

2.6.3 Alternative visualization: Mosaic plot (TAKE HOME)

Use the vcd (Visualization of Categorical Data) package. Install the package if you have not already.

  1. If you have not done so already, install the vcd package
install.packages("vcd")

The mosaic() function is built on base R plots. We can provide it directly with the contingency table. The shading options will shade based on over- or under-representation of each group based on the residuals from the Chi-square test.

library(vcd)
mosaic(smoke_vs_gest_table, shade=T, legend=T, gp=shading_binary)

2.6.4 Post-hoc for Chi-square test using Fisher’s Exact (TAKE HOME)

We have a significant difference from Chi-square. Which groups show the differences? Visually it looks like the full term and late term groups have different proportions of smokers.

Test for individual differences with Fisher’s Exact test.

# store groups to run tests over
cols <- ncol(smoke_vs_gest_table)
# start an empty data frame
term_stats <- data.frame()
# run a test for each column (term group) one at a time
for(i in 1:cols){
  # get the counts for this term group
  term.counts <- smoke_vs_gest_table[,i]
  # get the counts for all other term groups
  term.other <- rowSums(smoke_vs_gest_table[,-i])
  # we're specifically seeing if THIS term group has a different distribution
  # of smoking vs non-smoking mothers, using fisher's exact test
  term.table <- cbind(term.counts, term.other)
  fet <- fisher.test(term.table)
  # combine the estimate (odds ratio) and p-value) into the data frame
  # and use the term group as the row name
  term_stats <- rbind(term_stats, c(fet$estimate, fet$p.value) )
  rownames(term_stats)[nrow(term_stats)] <- colnames(smoke_vs_gest_table)[i]
}
# set the column names, and add log2-scaled odds ratio and FDR corrected p-value
colnames(term_stats) <- c("OddsRatio","P.Value")
term_stats$Log2OddsRatio <- log2(term_stats$OddsRatio)
term_stats$Q.Value <- p.adjust(term_stats$P.Value,method="fdr")
term_stats

We see statistically significant differences in mother smoking frequency for full term and late term.

We will discuss the FDR correction (used above) next.

2.7 False discovery rate

  1. Randomly generate vector wt (size=20) with mean 10 and standard deviation 3.
  2. Randomly generate vector ko (size=20) with mean 10 and standard deviation 3.
  3. Use t-test to test if wt is different from ko, and get the p-value for the test.
  4. Repeat this procedure 10000 times.
  5. How many \(p < 0.05\)?
  6. Calculate false discovery rate, a.k.a FDR or q-value.

NOTE: rnorm simulates n values (below, 20) from the normal distribution with a given mean and standard deviation.

# start empty vector
pval <- c()
# generate 10k random data sets from the
# SAME normal distribution
for (i in 1:10000){
        wt <- rnorm(20, mean=10, sd=3)
        ko <- rnorm(20, mean=10, sd=3)
        pval[i] <- t.test(wt, ko)$p.value
}
  • How many p-values are \(< 0.05\)?
sum(pval<0.05)
## [1] 498
summary(pval)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000265 0.2520704 0.4971856 0.4977919 0.7463348 0.9999326
  • Perform FDR correction and get a summary of the FDR corrected p values that were \(< 0.05\)
fdr <- p.adjust(pval, method="fdr")
sum(fdr<0.05)
## [1] 0
summary(fdr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2649  0.9842  0.9918  0.9882  0.9925  0.9999

3 Take-home exercises

3.1 Read/write data from/to Excel spreadsheets

NOTE: If you have not previously installed openxlsx then install from CRAN.

install.packages("openxlsx")

3.1.1 Read/write Excel spreadsheets using openxlsx

  1. Read the first sheet (tab)
library(openxlsx)
sheet_1 <- read.xlsx("http://wd.cri.uic.edu/advanced_R/taxa_relative.xlsx", sheet=1)
sheet_1[1:10, 1:5]
  1. Load the sheet (tab) named “L6”
sheet_L6 <- read.xlsx("http://wd.cri.uic.edu/advanced_R/taxa_relative.xlsx", sheet="L6")
sheet_L6[1:10, 1:5]
  1. Write a data frame to an Excel spreadsheet
bw_data  <- read.delim("https://wd.cri.uic.edu/R/birth_weight.txt")
# Create a workbook with two empty sheets
wb <- createWorkbook()
addWorksheet(wb, "birth weight")
# Write the birth weight data to Excel
writeData(wb, sheet="birth weight", x=bw_data)